Goto

Collaborating Authors

 learning rate scheduler



Appendix: On the Overlooked Pitfalls of Weight Decay and How to Mitigate Them

Neural Information Processing Systems

Suppose we have a non-zero solution ฮธ which is a stationary point of f(ฮธ,t) at t-th step and SGD finds ฮธt = ฮธ at t-th step. Theorem 2.2 of Shapiro and Wardi [9] told us that the learning rate should be small enough for convergence. Obviously, we have ฮท < in practice. As ฮทt = ฮทt+1 does not hold, SGD cannot converging to any non-zero stationary point. The proof is now complete.



Textless NLP -- Zero Resource Challenge with Low Resource Compute

arXiv.org Artificial Intelligence

Coding (VQ-CPC) [8] as the encoder in our speech processing The availability of text data for low-resource languages has pipeline. The input audio files are preprocessed and always been a challenge and transfer learning from multilingual extracted as log-Mel spectrograms. The initial processing models has its own limitations. End-to-End spoken systems involves convolution and normalization layers to extract highlevel without involving text have received significant attention features. These features are then passed through an in the recent years. The Zero-Resource challenge (ZRC) [1] auto-regressive network, which predicts future representations has enabled addressing the low-resource language representation of the input based on past information. One of the key problem and has been a significant driver in this area. In characteristics of VQ-CPC is its use of vector quantization as the acoustic unit discovery task for ZRC, high-dimensional a bottleneck to discretize the continuous embeddings extracted input speech data is mapped to its latent representation to by the autoregressive network into a finite set of discrete codes.


Probabilistic learning rate scheduler with provable convergence

arXiv.org Artificial Intelligence

Learning rate schedulers have shown great success in speeding up the convergence of learning algorithms in practice. However, their convergence to a minimum has not been proven theoretically. This difficulty mainly arises from the fact that, while traditional convergence analysis prescribes to monotonically decreasing (or constant) learning rates, schedulers opt for rates that often increase and decrease through the training epochs. In this work, we aim to bridge the gap by proposing a probabilistic learning rate scheduler (PLRS), that does not conform to the monotonically decreasing condition, with provable convergence guarantees. In addition to providing detailed convergence proofs, we also show experimental results where the proposed PLRS performs competitively as other state-of-the-art learning rate schedulers across a variety of datasets and architectures.


Automatic gradient descent with generalized Newton's method

arXiv.org Artificial Intelligence

We propose the generalized Newton's method (GeN) -- a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, out method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers. Code to be released at \url{https://github.com/ShiyunXu/AutoGeN}.


Cyclical Log Annealing as a Learning Rate Scheduler

arXiv.org Artificial Intelligence

A learning rate scheduler is a predefined set of instructions for varying search stepsizes during model training processes. This paper introduces a new logarithmic method using harsh restarting of step sizes through stochastic gradient descent. Cyclical log annealing implements the restart pattern more aggressively to maybe allow the usage of more greedy algorithms on the online convex optimization framework. The algorithm was tested on the CIFAR-10 image datasets, and seemed to perform analogously with cosine annealing on large transformer-enhanced residual neural networks. Future experiments would involve testing the scheduler in generative adversarial networks and finding the best parameters for the scheduler with more experiments.


DeepSeek-VL: Towards Real-World Vision-Language Understanding

arXiv.org Artificial Intelligence

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: Data Construction: We strive to ensure our data is diverse, scalable and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content (expert knowledge, textbooks), aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction-tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Model Architecture: Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024) within a fixed token budget, while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. Training Strategy: We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. Starting with a focus on text, we gradually adjust the ratio to facilitate a balanced integration of both modalities.


Gradient Informed Proximal Policy Optimization

arXiv.org Artificial Intelligence

We introduce a novel policy learning method that integrates analytical gradients from differentiable environments with the Proximal Policy Optimization (PPO) algorithm. To incorporate analytical gradients into the PPO framework, we introduce the concept of an {\alpha}-policy that stands as a locally superior policy. By adaptively modifying the {\alpha} value, we can effectively manage the influence of analytical policy gradients during learning. To this end, we suggest metrics for assessing the variance and bias of analytical gradients, reducing dependence on these gradients when high variance or bias is detected. Our proposed approach outperforms baseline algorithms in various scenarios, such as function optimization, physics simulations, and traffic control environments. Our code can be found online: https://github.com/SonSang/gippo.


A Visual Guide to Learning Rate Schedulers in PyTorch

#artificialintelligence

Neural networks have many hyperparameters that affect the model's performance. One of the essential hyperparameters is the learning rate (LR), which determines how much the model weights change between training steps. In the simplest case, the LR value is a fixed value between 0 and 1. However, choosing the correct LR value can be challenging. On the one hand, a large learning rate can help the algorithm to converge quickly.